Site Reliability Engineer DevOps

Remote
Full Time
Experienced

SITE RELIABILITY ENGINEER 

SUMMARY: 

Since 2006 PEX has been on a steady march to build and evolve a solution that helps improve the way organizations operate in order to make them more efficient, more nimble, and more competitive. 

PEX has evolved into a robust, secure SaaS solution with a deep suite of workforce spend management capabilities, advanced card controls, real-time visibility into card usage, and improved reconciliation processes. More importantly, we are providing a better, more effective solution for thousands of companies and hundreds of thousands of people in the workforce. We work each day to find new ways we can help our clients operate more efficiently. 

Our environment is a mix of Windows and Linux machines that reside on-premise and in the cloud. It is crucial that all work is performed under strict adherence to PCI DSS requirements, and our environment is required to be available 24x7. 

WHO YOU ARE: 

As a Site Reliability Engineer, you will be responsible for planning, production, and engagement with software developers and infrastructure engineers to integrate software development and delivery. 

WHAT YOU’LL DO: 

Architectural oversight and ownership of web delivery stack - from the server/service to the end-user. 

Continuous improvement of system and application monitoring and automation

Ensuring sufficient monitoring of infrastructure, systems, and application availability, performance, and capacity 

Ensuring sufficient monitoring of the availability, latency, scalability, and efficiency of all services 

Promoting availability and stability in a 24/7 high-availability environment

Participating in an on-call rotation

REQUIRED SKILLS & QUALIFICATION 

Strong experience with Linux and at least one programming language (e.g. Python, Go, Ruby) 

Experience with containerization and orchestration technologies such as Docker and Kubernetes 

Experience with cloud infrastructure (e.g. Azure, AWS, GCP) as well as Infrastructure-as-Code tooling (e.g. Terraform) and CI/CD practices. 

Familiarity with monitoring, tracing, and logging tools (e.g. Zabbix, SumoLogic), including concepts such as SLI/SLO and error budgets. 

Strong problem-solving skills and ability to troubleshoot complex issues

Strong communication skills and ability to work well in a team 

Experience with incident management and incident response 

Strong understanding of networking protocols and concepts 

Understanding of security concepts and best practices 

Strong understanding of system performance metrics and how to interpret them

Ability to operate individually and as part of a team. 


 

Share

Apply for this position

Required*
Apply with Indeed
We've received your resume. Click here to update it.
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Human Check*